AITopics | visual perspective

Collaborating Authors

visual perspective

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Towards Embodied Cognition in Robots via Spatially Grounded Synthetic Worlds

Currie, Joel, Migno, Gioele, Piacenti, Enrico, Giannaccini, Maria Elena, Bach, Patric, De Tommaso, Davide, Wykowska, Agnieszka

arXiv.org Artificial IntelligenceMay-21-2025

We present a conceptual framework for training Vision-Language Models (VLMs) to perform Visual Perspective Taking (VPT), a core capability for embodied cognition essential for Human-Robot Interaction (HRI). As a first step toward this goal, we introduce a synthetic dataset, generated in NVIDIA Omniverse, that enables supervised learning for spatial reasoning tasks. Each instance includes an RGB image, a natural language description, and a ground-truth 4X4 transformation matrix representing object pose. We focus on inferring Z-axis distance as a foundational skill, with future extensions targeting full 6 Degrees Of Freedom (DOFs) reasoning. The dataset is publicly available to support further research. This work serves as a foundational step toward embodied AI systems capable of spatial understanding in interactive human-robot scenarios.

artificial intelligence, spatial reasoning, visual perspective, (12 more...)

arXiv.org Artificial Intelligence

2505.14366

Country:

North America > United States (0.14)
Europe > United Kingdom (0.14)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Robots > Humanoid Robots (0.60)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.39)

Add feedback

Beyond Recognition: Evaluating Visual Perspective Taking in Vision Language Models

Góral, Gracjan, Ziarko, Alicja, Miłoś, Piotr, Nauman, Michał, Wołczyk, Maciej, Kosiński, Michał

arXiv.org Artificial IntelligenceMay-8-2025

We investigate the ability of Vision Language Models (VLMs) to perform visual perspective taking using a novel set of visual tasks inspired by established human tests. Our approach leverages carefully controlled scenes, in which a single humanoid minifigure is paired with a single object. By systematically varying spatial configurations - such as object position relative to the humanoid minifigure and the humanoid minifigure's orientation - and using both bird's-eye and surface-level views, we created 144 unique visual tasks. Each visual task is paired with a series of 7 diagnostic questions designed to assess three levels of visual cognition: scene understanding, spatial reasoning, and visual perspective taking. Our evaluation of several state-of-the-art models, including GPT-4-Turbo, GPT-4o, Llama-3.2-11B-Vision-Instruct, and variants of Claude Sonnet, reveals that while they excel in scene understanding, the performance declines significantly on spatial reasoning and further deteriorates on perspective-taking. Our analysis suggests a gap between surface-level object recognition and the deeper spatial and perspective reasoning required for complex visual tasks, pointing to the need for integrating explicit geometric representations and tailored training protocols in future VLM development.

humanoid minifigure, large language model, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2505.03821

Country: North America > United States > California (0.46)

Genre: Research Report (1.00)

Industry: Health & Medicine > Therapeutic Area > Neurology (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.92)

Add feedback

Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities

Ghosh, Sreyan, Kong, Zhifeng, Kumar, Sonal, Sakshi, S, Kim, Jaehyeon, Ping, Wei, Valle, Rafael, Manocha, Dinesh, Catanzaro, Bryan

arXiv.org Artificial IntelligenceMar-5-2025

Understanding and reasoning over non-speech sounds and music are crucial for both humans and AI agents to interact effectively with their environments. In this paper, we introduce Audio Flamingo 2 (AF2), an Audio-Language Model (ALM) with advanced audio understanding and reasoning capabilities. AF2 leverages (i) a custom CLAP model, (ii) synthetic Audio QA data for fine-grained audio reasoning, and (iii) a multi-stage curriculum learning strategy. AF2 achieves state-of-the-art performance with only a 3B parameter small language model, surpassing large open-source and proprietary models across over 20 benchmarks. Next, for the first time, we extend audio understanding to long audio segments (30 secs to 5 mins) and propose LongAudio, a large and novel dataset for training ALMs on long audio captioning and question-answering tasks. Fine-tuning AF2 on LongAudio leads to exceptional performance on our proposed LongAudioBench, an expert annotated benchmark for evaluating ALMs on long audio understanding capabilities. We conduct extensive ablation studies to confirm the efficacy of our approach. Project Website: https://research.nvidia.com/labs/adlr/AF2/.

background, classification, visual perspective, (16 more...)

arXiv.org Artificial Intelligence

2503.03983

Country:

North America > United States > Maryland > Prince George's County > College Park (0.14)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Asia > China > Heilongjiang Province > Harbin (0.04)
North America > Canada > Ontario > Toronto (0.04)

Genre: Research Report (1.00)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.92)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.45)

Add feedback

Seeing Through Their Eyes: Evaluating Visual Perspective Taking in Vision Language Models

Góral, Gracjan, Ziarko, Alicja, Nauman, Michal, Wołczyk, Maciej

arXiv.org Artificial IntelligenceSep-2-2024

Visual perspective-taking (VPT), the ability to understand the viewpoint of another person, enables individuals to anticipate the actions of other people. For instance, a driver can avoid accidents by assessing what pedestrians see. Humans typically develop this skill in early childhood, but it remains unclear whether the recently emerging Vision Language Models (VLMs) possess such capability. Furthermore, as these models are increasingly deployed in the real world, understanding how they perform nuanced tasks like VPT becomes essential. In this paper, we introduce two manually curated datasets, Isle-Bricks and Isle-Dots for testing VPT skills, and we use it to evaluate 12 commonly used VLMs. Across all models, we observe a significant performance drop when perspective-taking is required. Additionally, we find performance in object detection tasks is poorly correlated with performance on VPT tasks, suggesting that the existing benchmarks might not be sufficient to understand this problem. The code and the dataset will be available at https://sites.google.com/view/perspective-taking

claude 3, dataset, language model, (13 more...)

arXiv.org Artificial Intelligence

2409.12969

Country:

North America > United States > Nebraska (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Poland > Masovia Province > Warsaw (0.04)

Genre: Research Report (1.00)

Industry: Health & Medicine > Therapeutic Area > Neurology (0.68)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.31)

Add feedback

Visual Perspective Taking for Opponent Behavior Modeling

Chen, Boyuan, Hu, Yuhang, Kwiatkowski, Robert, Song, Shuran, Lipson, Hod

arXiv.org Artificial IntelligenceMay-11-2021

In order to engage in complex social interaction, humans learn at a young age to infer what others see and cannot see from a different point-of-view, and learn to predict others' plans and behaviors. These abilities have been mostly lacking in robots, sometimes making them appear awkward and socially inept. Here we propose an end-to-end long-term visual prediction framework for robots to begin to acquire both these critical cognitive skills, known as Visual Perspective Taking (VPT) and Theory of Behavior (TOB). We demonstrate our approach in the context of visual hide-and-seek - a game that represents a cognitive milestone in human development. Unlike traditional visual predictive model that generates new frames from immediate past frames, our agent can directly predict to multiple future timestamps (25s), extrapolating by 175% beyond the training horizon. We suggest that visual behavior modeling and perspective taking skills will play a critical role in the ability of physical robots to fully integrate into real-world multi-agent activities. Our website is at http://www.cs.columbia.edu/~bchen/vpttob/.

prediction, representation, robot, (14 more...)

arXiv.org Artificial Intelligence

2105.05145

Country: North America > United States > Nebraska (0.04)

Genre:

Research Report (0.64)
Workflow (0.48)

Industry:

Health & Medicine > Therapeutic Area > Neurology (0.68)
Leisure & Entertainment > Games > Computer Games (0.46)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Cognitive Science (1.00)

Add feedback

Study finds 'lower class' groups are better at reading emotions than the 'higher class'

Daily Mail - Science & techSep-4-2020, 18:49:27 GMT

Those deemed in the higher class may be envied for their luxurious cars, large homes and stylish clothes, but there is one thing they do not have – the ability to read people's emotions. A study used a cognitive empathy test called'the Reading the mind in the eyes,' which participants from higher and lower social classes were asked to determine emotional states from images of eyes. The results showed those in the lower class were better at understanding other people's minds compared to their counterparts. Experts suggest the reason is because lower social classes tend to prioritize the needs and preferences of others, and are ultimately more empathetic. A study used a cognitive empathy test called'the Reading the mind in the eyes,' which participants from higher and lower social classes were asked to determine emotional states from images of eyes - and the team calculated the scores The study was conducted by a team at the University of California, Irvine who questioned – 'How does access to resources (e.g., money, education) influence the way we process information about other human beings,' PsyPost reported.

artificial intelligence, lower class, social class, (17 more...)

Daily Mail - Science & tech

Country:

North America > United States > California > Orange County > Irvine (0.27)
North America > United States > New York (0.06)

Technology: Information Technology > Artificial Intelligence > Cognitive Science > Emotion (0.40)

Add feedback